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Abstract 

We study online prediction where regret of the algorithm is measured against a benchmark defined 
via evolving constraints. This framework captures online prediction on graphs, as well as other prediction 
problems with combinatorial structure. A key aspect here is that finding the optimal benchmark predictor 
(even in hindsight, given all the data) might be computationally hard due to the combinatorial nature of 
the constraints. Despite this, we provide polynomial-time prediction algorithms that achieve low regret 
against combinatorial benchmark sets. We do so by building improper learning algorithms based on two 
ideas that work together. The first is to alleviate part of the computational burden through random 
playout, and the second is to employ Lasserre semidefinite hierarchies to approximate the resulting 
integer program. Interestingly, for our prediction algorithms, we only need to compute the values of the 
semidefinite programs and not the rounded solutions. However, the integrality gap for Lasserre hierarchy 
does enter the generic regret bound in terms of Rademacher complexity of the benchmark set. This 
establishes a trade-off between the computation time and the regret bound of the algorithm. 


1 Introduction 

To motivate the general setting of the paper, let us start with an example. Consider the problem of node 
label prediction in an evolving social network. At each round, a new user joins the network and makes 
connections to some existing users. The observable part of a user’s type is represented by a covariate vector 
(or, side information) that may consist of gender, age, education level, and other revealed characteristics. 
Suppose we are tasked with developing a system that predicts a “label” for the user, in a possible set of 
outcomes. For instance, our goal might be to conduct a successful marketing campaign; here, the unseen 
labels could stand for the type of product the user will buy. Having made the prediction, we observe the 
actual behavior of the person (such as a purchase) and suffer a cost if the prediction was wrong. 

We would like to devise a framework for developing prediction algorithms for this problem. Several 
aspects require careful consideration. First, how do we phrase the goal of the forecaster? Second, how do we 
model the evolution of the graph, arrival of users, and users’ covariate vectors? Third, how can we leverage 
global information dispersed in the network in order to make good predictions on the individual level? Last 
but not least, how do we develop computationally feasible prediction methods? 

To make matters concrete, consider an example where at each time step t a new user joins the network, 
and the links (edges) to other users are revealed along with side information Xt about the user. We may think 
of the weights Wij £ [—1,1] as the strength of similarity (dissimilarity) between users i and j. This number 
is only known if i,j < t. The system makes a binary prediction y tl and the actual label yt of the user is 
subsequently revealed. For developing such a prediction system, the practitioner would need to incorporate 
prior knowledge about the problem. For instance, it might be reasonable to assume that at the end of V 
rounds, the nodes of the graph will be roughly clustered in terms of their labels, with within-community 
links being mostly positive and across-community links being mostly negative. In addition to this adherence 
of labels to the graph structure, we also encode prior information through a function class T of mappings 
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from side-information to labels. For instance, in binary classification it might be reasonable to suspect a 
linear separation between the two classes in terms sign(u> • Xt) for some w. Unfortunately, the connectivity, 
side information, and the labels are only partially known until the end of V rounds. Nevertheless, we set 
the goal as that of predicting as well as if this information were available: the performance is measured by 
the regret 

g 1 {fM * »> • 

where .F[data] C T is only known at the end of V rounds (precise definition given in the next section). 
.F[data] is a data-dependent set of labelings that (we hope) models well the prediction problem at hand (see 
[CBGVZ13] and references therein for related graph prediction problems). 

Given the interpretation that positive W, j ’s encode similarity and negative Wg’s encode dissimilarity, it 
is natural to let T be a set of labelings such that the number of disagreements at endpoints is minimized 
for edges with positive weights and maximized for edges with negative weight. This smoothness of / £ T 
with respect to the graph can be encoded by the graph Laplacian L, and one can use T = {/ G {il} 17 " : 
f T Lf < K} 7 for some parameter K > 0 [RS14]. The authors of the latter paper proposed a straightforward 
relaxation to obtain a computationally feasible method, at the expense of having a larger regret bound. This 
is a starting point for the present paper. 

We depart from the usual regret minimization framework in several ways. First, instead of restricting 
the set of possible labelings based solely on the graph structure and edge weights, we model the set T 
through the number of satisfied constraints. To this extent, a graph structure is just a particular set of 
constraints that involve pairs of nodes (which we shall interchangeably call “items” or “individuals”). A 
more general constraint might involve groups of individuals, and this gives greater flexibility in modeling the 
overall interaction between the nodes. Formally, a constraint is an arbitrary binary or real-valued function 
from assignments of labels for a subset of nodes to K>o- Within theoretical computer science, constraint 
satisfaction problems (CSPs) are a natural umbrella for such combinatorial problems as Max Cut, Unique 
Games, and Max /c-SAT. Furthermore, under the Unique Games Conjecture, semidefinite relaxations are 
providing an optimal approximation ratio for every CSP [Rag08, RS09]. One of the goals of his paper is to 
apply semidefinite relaxation techniques to the problem of online prediction with combinatorial constraints. 

The second way in which we depart from the traditional work on online learning is in allowing constraints 
to be revealed in an online manner. For the example of a graph-based constraints, this means that the graph 
can be revealed to the forecaster sequentially. Moreover, we can think of the graph as evolving in time since 
identities of the nodes have little significance, except for being arguments to constraints. We assume that 
the probability distribution that governs this evolution is known to the forecaster. As a particular case, the 
distribution may put all the mass on the revelation of all the constraints at the first round, in which case the 
constraints (or, the graph) are “known ahead of time.” More generally, one may take graph evolution models 
studied in probability theory and in social networks research, and use these for the prediction problem. In 
addition to the evolution of constraints, we allow the forecaster to observe side information about the new 
node. This side information is, once again, stochastic and follows a distribution jointly with constraints and 
node identities. 

While the constraints and side information are stochastic, the label is chosen in an adversarial way. We 
have in mind the situation where we can model the network structure and the distribution of people types, 
but the label (or, action) of the person is not easily modeled. Instead, this behavior can be best understood 
through global information within the network, not the local information. Such a global coherence of labels 
and the constraints is modeled through the comparator class T. 

It would appear that the overall framework involving constraints, side information, and adversarially 
chosen labels cannot yield computationally tractable algorithms. Yet we show that by moving to improper 
prediction algorithms one can develop computationally efficient methods for the problem with only slight 
worsening of the regret guarantees. As a first step towards developing efficient methods, we show that 
the knowledge of the overall distribution governing the presentation of constraints and the side information 
allows us to define a randomized method with a provable guarantee on prediction error. We analyze “random 
playout,” a method that simulates future constraints and side information and uses these hallucinated values 
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in place of missing information. We show that such an algorithm (which arises from the relaxation framework 
in [RSS12]) has regret that is bounded by classical Rademacher complexity of T given the constraints and 
side information. 

The last missing piece in this story is how to calculate the next prediction given the random playout. 
Here, we show that the forecaster needs to compute a value with conditional Rademacher complexity as 
part of the objective. In general, the computation of Rademacher complexity is not a feasible task for the 
types of combinatorial constraints we have in mind. However, the online relaxation framework suggests that 
we may take a superset of T (given the constraints and side information) and suffer regret of Rademacher 
complexity of this larger set. We propose to use semidefinite hierarchies for this task. In particular, we 
define Lasserre hierarchy [LasOl, Par03] to obtain polynomial-time prediction methods with a “knob” (level 
of the hierarchy) that trades off computational time and prediction performance as measured by the regret. 

In this paper, two distinct uses of the word “relaxation” come together. Online relaxations are upper 
bounds on the minimax value of the multistage prediction problem [RSS12]. One of a number of approaches 
for obtaining online relaxations is to increase the set of benchmark solutions. The latter is a relaxation in 
the sense of optimization, as we show in the paper. Indeed, in this case, online relaxations and optimization 
relaxations are put on the same footing, and any distinction between the two should be clear from the 
context. 

We use semidefinite relaxations in a somewhat unconventional way because the end goal is the problem 
of prediction. The online relaxation requires us to compute the value of the relaxed objective rather than the 
integer solution. Sidestepping the need to round the solution is a nice feature of “improper” prediction meth¬ 
ods. The integrality gap still comes into the picture, as it effectively quantifies the increase of Rademacher 
complexity for the larger set. Yet, the regret bound only requires existence of a rounding procedure with a 
given guarantee and not its implementation. Crucially, the multiplicative increase due to the integrality gap 
is a constant that enters the regret bound only, leaving the constant in front of the comparator (OPT) to 
be one! The way in which the power of semidefinite relaxations fuses with the power of online relaxations is 
rather fortuitous. 

The statements proved in this paper have an interesting “modularity” property. As soon as one finds a 
rounding procedure with a smaller integrality gap, this gap can be immediately inserted in the regret upper 
bound of our method. The prediction algorithm itself does not change, as it does not need to round the 
solution. Further, since Lasserre hierarchies we are employing are known to be tighter than LP-based and 
other hierarchies, the integrality gap can be proved for these weaker approximation methods. 

We remark that it has been noted in the literature by various authors that the problem of prediction 
can be solved in situations when the offline solution is NP-hard (see e.g. [HKS12, Chrl4, AbelO]). Our 
work can be seen as formally extending this statement to approximation schemes, with an additional knob 
for the computation-prediction tradeoff. We also remark that ideas similar in spirit have been proposed in 
[CRPW12, CJ13], among others, in the statistical (rather than online) setting. In particular, the recent 
paper of [BM15] gives very strong guarantees for learning third-order tensors using the 6th level of the 
sum-of-squares hierarchy. The authors compute a tight bound on the Rademacher complexity of the relaxed 
norm. 

In summary, our contribution involves a framework for online prediction of labels for individuals that 
appear in a streaming fashion, with side information about individuals and constraints being also revealed 
in an online manner. The labels themselves can be adversarially chosen, while we assume that the stochastic 
model of the constraints and side information is known a priori. We propose a general method that is based 
on random playout, and further propose a semidefinite relaxation for the resulting CSP-like problem. We 
prove several regret bounds for the prediction method in terms of integrality gaps. The method allows for a 
trade-off between computation time and performance guarantee. 

This paper is organized as follows. After describing the setting in the next section, we present in Section 3 
the formalism of online relaxations and state a generic random-playout algorithm with a regret guarantee in 
terms of the expected relaxation. In Section 4 we show that the relaxation based on classical Rademacher 
averages is “admissible”, and we state the computationally-difficult problem. In Section 5 we relax the 
problem in the SDP language of Lasserre hierarchy. Section 6 makes the connection between the integrality 
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gap and the regret bound of the r-th level in the hierarchy. The main result here is Theorem 3 which gives 
a regret bound in terms of the Rademacher complexity and the integrality gap. We turn to an alternative 
“Lagrangian” form of the optimization problem in Section 7 and prove a regret bound for the r-th level of this 
form of relaxation (Theorem 5). Several examples are discussed in Section 8 , and the paper is concluded with 
a lower bound in Section 9 which shows near-optimality of our methods in terms of prediction performance. 

Notation We use the following shorthand notation: let [n] = {l,...,n}, a\- t = (ai,...,a t ), (a, 6 ) i :t = 
(ai, &i,..., a t , b t ). We denote by A(A) the set of distributions on the set A. 


2 Setting 


On each round t = 1,..., V, the forecaster observes a new item along with side information xt £ X t C X and 
a set of constraints. The forecaster then makes a prediction y t £ {1,..., k} = [k] and observes the label 
yt £ [k]. The side information set X t may be time-varying, but is known to the forecaster. Each constraint 
c £ is represented by a pair ( S C ,R C ) where S c C V and R c : [k] Sc >->• M>o- For an assignment g £ [n] v . 
we write c(g) or R c (g) for the value of R c on g(S c ). To lighten the notation, let us introduce a shorthand 
T t — (f^t ; x t) for the associated constraints and the side information for the item. 

Example 1. Let k = 2 and let g £ {1,2}^ be an assignment of binary labels to vertices of an unweighted 
graph G = (V,E). Define a constraint c for each edge (u,v) £ E by taking S c = (u,v) and R c (g u ,9v ) = 
1 {g u 7 ^ g v }. Any labeling g defines a partition of G, and the size of the cut is precisely c(g). 

Let T be a class of functions X —> [«:]. Each / £ T gives rise to a vector (/(xi), ..., /(xy)) of labelings 
of the items. Given xi,...,xy, each / £ T induces an assignment vector [f{xj)]J =1 £ [k]' , and now 
c ([/( a; j)]}=i) represents the value of the constraint c on this assignment. 

Let Lfif t = U X-\^t denote the union of all the constraint sets. Given this union, as well as x\.v, we define 
the subset of those functions that do not violate more than K constraints as 


F K [h:v\ = \f£F-- ]T c ([f(x 1 ),...J(x v )])< K \ (1) 

v cGU^t J 

for some given K > 0. 

Example 2. Continuing with Example 1, let T = {/(x) = 1 {( w,x) > 7 } + 1 : w £ R d }. The set in (1) is 
then the set of homogenous hyperplanes that classify the vertices of the graph with a margin 7 in such a way 
that the cut is at most of size K. 

Let £(yt,yt) = 1 {yt 7 ^ yt} be the indicator loss function. The goal of the forecaster is phrased as 
minimization of regret 


v 

R eg = ^2^{yt,yt) 

i=1 


inf 

1:v] 


V 

^£(/(x t ),y t ) 

t =1 


( 2 ) 


with respect to the (data-dependent) subset of J-. This definition forces the forecaster to perform nearly as 
well as the benchmark that satisfies the constraints up to a certain threshold. 

We remark that the class T is “pruned” as more information about the constraints arrives over time. 
This pruning in effect captures the global information in the network, which requires adherence of labelings 
(given locally by values of / on the side information) to the global structure of constraints. It is important 
to recognize that the forecaster faces a difficulty: the “pruned” set ( 1 ) of comparators can only be calculated 
in hindsight. 

We assume that the constraints and side information are drawn from a distribution known to the fore¬ 
caster. That is, given I\-t- 1 , we assume that the forecaster is able to draw samples from the conditional 
distributions 


p(^,x t |Xi :t -i). 


(3) 
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Example 3 (Preferential Attachment). In the preferential attachment model, the set of constraints 
corresponds to a set of new edges connected to previously revealed nodes. The edges are drawn according 
to the node degree given by the set of edges ‘ifi-.t-i- In this example, the distribution does not depend on 
side-information. 

Example 4 (Geometric Random Graphs). We may allow Xt’s to be drawn from some fixed distribution that 
does not depend on the constraints. In turn, the constraints can be formed according to the side information. 
One example is a geometric random graph, where pairwise constraints (graph edges) are formed according 
to distances from the new random point which may be given by the distance between the side information 
vectors. It is known that such graphs have better spectral properties [BHHS11]. The result in this paper indeed 
employ an average (rather than the worst-case) integrality gap and can take advantage of “nice” graphs. 

Example 5 (Unlabeled Data). Rather than assuming the knowledge of the distribution ofxt’s, the random 
play-out algorithm introduced in the paper may tap into a pool of unlabeled data. 

Other examples of distributions include a variant of the stochastic block model (SBM). This generative 
process provides the simplest model of group formation (though we remark that we are not aiming to recovery 
a hidden labeling, which is the focus of much research on SBM). 

The upper bounds on regret obtained in this paper will also hold for an intermediate time horizon n < V. 
This “anytime” property follows from the fact that constraints are only added, and not deleted. If one is 
only concerned with regret at time V, the deletion is easy to incorporate in the model. 

Finally, let us mention that much of prior literature on online prediction on graphs requires the knowledge 
of the graph from the beginning. When the order in which nodes are presented is given to us in advance the 
problem is readily modeled by our setting via X t = { 0 - We then write f(x t ) = f(t), precisely the notation 
for a static expert [CBL06]. On the other hand, the case when nodes are presented to us in adversarial 
fashion is not directly modeled by the presented setting. However, the algorithms presented here can be 
easily extended to such a scenario. Indeed, at every round t , we simply pick some prefixed order for remaining 
unseen nodes and make predictions assuming this is the order in which nodes will be presented. On similar 
lines as the inductive proof in [CBS11], we can show that the algorithm enjoys the same regret against an 
adversarial ordering of nodes as the algorithm would for the case when the order is known in advance. 

In summary, we presented a flexible problem definition that models the arrival of items and the evolution 
of constraints. The model encapsulates local information about the items. The goal of the forecaster is 
phrased as a global measure of coherence given all the information at the end of the day. The rest of the 
paper is focused on exhibiting randomized methods that provably minimize regret in this general framework. 
We also focus on the computational issues associated with making predictions. 


3 Online Relaxations 

The idea of online relaxations was studied in [RSS12] as a generic recipe for deriving prediction algorithms. 
The basic technique for our context is as follows. Consider for a moment the problem that does not involve 
constraints, and suppose x±,. .., Xy are provided to the forecaster ahead of time. At time t, the forecaster 
predicts yt £ y and observes yt £ y. Furthermore, suppose the comparator set Q of functions X —> y in 
the regret definition is fixed. Given a loss function I : y x y H► R, an online relaxation Rel is a sequence of 
functions that satisfies two conditions. First is the dominance condition: for any sequence of instances X\.,v 
and yi, v , 


Rel (Q \y 1:V ) > 


V 


* 9 t= 1 


(4) 


Second is the recursive condition: for any t £ [ V ], 


inf sup {E 57 „ 
q t eA{y) ytG y 


[f-{yu yt)] + Rel (i G \yv.t)} < Rel (G \yi :t -i 


(5) 
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A relaxation that satisfies these conditions is termed admissible. Given a relaxation Rel for a class G, define 
an online learning algorithm which at time t, given instances yi-.t—i and X\-v, makes the random prediction 
y t by drawing from the distribution q t £ A(jy) either given by 

Qt = argmin sup [£(y t ,y t )\ + R-el (G \yv.t)} , 

?eA(y) v t &y 

or by any other choice that ensures admissibility of the relaxation. It can be easily shown that regret of such 
a strategy is upper bounded (in expectation and with high probability) by E [Rel (Q |0)[. 

We now turn to the case of side-information and constraints being revealed to the forecaster sequentially. 
We would like to “lift” the admissibility technique to this situation. To start, assume that we have a 
relaxation that is admissible for any class G = J~k PAv]- We propose the following simple randomized 
strategy. 

At time t, given I\ :t = ( < tf s ,x a )l =1 , draw Tt+i-.v = &,x)t+i:V from the known distribution p. Pick 
distribution q t over y as follows 

q t (l t+ i:v) = argmin sup {Eg- t „ g [£(y t , y t )} + Rel (T K [T 1:V \ \yi-.t)} ( 6 ) 

?£A (y) v t &y 

and make a randomized prediction according to g t (I t+ l-.v)- 

As mentioned in the introduction, the above randomized method is of a “random playout” style. The fore¬ 
caster simulates future draws to solve the (otherwise difficult) problem in expectation. The next lemma guar¬ 
antees a bound on the expected regret in terms of expected Rademacher complexity of the data-dependent 
class. The upper bound behaves as if the forecaster were able to integrate over the complete distribution p 
on each round, despite the fact that the method only draws one sample. 

Lemma 1. Suppose Rel is an admissible relaxation for any Tk\T\-.v\- Then the randomized algorithm given 
in (6) enjoys the performance guarantee 

E [Reg] < E (¥iX)l!V [Rel | 0 )] 

The proof of this lemma is postponed to the appendix. We refer to [RSS12] for more details of the 
technique. 

Of course, the question remains: how do we come up with admissible relaxations required by Lemma 1. 
This is the subject of the next section. 


4 Rademacher-Based Relaxations 


The previous section presented a generic randomized prediction algorithm when the forecaster can sample 
from the distribution p that generates the constraint sets and the side information. In this section, we provide 
a specific form of the relaxation we can use, along with the corresponding regret bound. The forecaster will 
be required to solve n optimization problems per round to obtain the randomized prediction for that round. 

Let M be a set of V x n matrices such that for any M £ M. every t £ [V] and k £ [k], M t .k £ [0,1] 
and 'YUl-i M tt k < 1. Given any class Q of functions X —> [k] and side information Xi-.v, we define a set of 
matrices Mg as 

Mg = {M f : f £ G, M ttk = 1 {f(x t ) = fc}}- 

If k = 2, each Mf can be simply represented by a vector of binary labels that / assigns to aq,..., xy- 

Lemma 2. For any class G of predictors, if Mg C M, then the following relaxation is admissible for 
prediction with respect to class G ■ 


Rel (G \yi-.t) = E et+1:V 


! V K t 


— t . 
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Here, each £j is a vector of independent Rademacher random variables and €j ik stands for the k th coordinate 
of this vector. Further, the randomized strategy corresponding to the above relaxation is given by first drawing 
e t + i : v Rademacher vectors and then predicting yt according to 

f v 

+ sup < 2 y 

Recall that Lemma 1 provides a generic randomized strategy that, at round t, generates the future in¬ 
stances It+i-.v and then uses as a black box an admissible relaxation for function classes Tk [A : y ]■ By 
combining Lemma 2 and Lemma 1, we get the following randomized prediction strategy: 


E 

k =1 


t 

£j,kMj,k + M S ' Vs 


— t 


q t (e t +i: V ) = argmin sup <( 1 - q[y t \ 
<J6 A([k]) y t e[n\ 


At time t, given side information x\,t, constraint sets e £\,t and past labels yi-.t-i, draw It+i-.v from p. Next, 
draw Rademacher vectors £t+i-.v and compute, for each o £ [k], the value 


Rt(o ) 


sup 

M£M(h,v) 


V K t— 1 

EE £j,kMj,k + Mt,o + 'y ' 

j=t +1 k—1 s= 1 


(7) 


where A4(li-.v) is some set of matrices such that M. jf k [z 1v ] C M.ifL i : y). Finally, we solve for the randomized 
strategy q t (e t+1 , v ) given by 


q t {e t+ i, v ) = argmin max{l - q[o] + R{o)} . (8) 

q£ A k o£[k] 

Finally, predict y t by simply drawing it from g t (e t+ i : y). 


Note that the step of solving for qt(e t +i-.v) can be done efficiently by first sorting R t ( 1),... ,Rt{n )'s in 
descending order and then using a simple water filling argument to find qt{et+i-.v)- 

For the algorithm outlined above, in view of Lemma 1, the expected regret is upper-bounded as: 


E [Reg] < 2 E(V' X)l:V E ei;V 


V K 

sup V V ej tk M jk 

MZM(I 


(9) 


Of course one could use Al( Ii-.v) = M.y K \x x . v \ ■ However in many prediction problems of interest, solving 
the optimization problem (i.e., computing i? t (o)) for this class might be computationally hard. Hence, for 
computational efficiency we shall use a superset of Al jr K \x,. v \ ■ We pay for computational efficiency by 
having a worse regret bound given by the Rademacher complexity over the larger set Af rather than 

We investigate this topic in the next two sections. 


5 Prediction Based on Lasserre SDP Hierarchy 

In the previous section we provided a randomized prediction strategy based on any class of matrices AI(Ii : y) 
that is a superset of M.j^ k m, v i. In this section we will employ Semidefinite Programming and Lasserre 
hierarchies to solve for the values R(o), defined in (7). 

Let us begin with and relax the problem. By the definition of M.T K [i l v p we can write down 

the optimization problem for each o € [k] as 

V K, t-1 

EE 6 j^ Adj -f- M t ,o + ^ ^ A 'fs,y s 

j=t-\-1 k=1 s =1 

V « t-l 'j 

= max 2 EE £j,kM j:k + M t ,c + Y M ^vs f s.t. Y C ( M ) < K > Me T Xl . v 

\ j=t+1 k=1 s =1 J cGU^t 


max 

MeM r K [z 1:V l 
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where, 


= {Me {0, \} VXK ■. M t ,i = l { f(x t ) = i}, f € T, t e [n i e [re]} 

We shall assume throughout this section that for any X\,y> the set F Xl . v can be represented as {0, l} yxK n 
V x 1:V where 'P Xl - v c R Vxk can be represented by linear constraints efficiently. The superscript with side 
information is to remind us that the constraints can depend on the side information presented. To best 
match semidefinite formulations found in the literature, we assume, 

V X1V = {M £ K yxK : Wj £ [d\, M t B j < Cj }, 


an intersection of d linear constraints. (Henceforth, whenever we refer to a matrix M as a vector, we mean 
the vectorized form.) The reason for the assumption is that we would like to apply Lasserre Hierarchy to 
represent {0, 1 } Vxk n V. As an example, for the case of all possible static experts, we are interested in 
predicting as well as any labeling that violates at most K constraints and, hence, 'P Xl - v is simply [0, 1] Vxk . 

Given yi-.t-i , o £ [re], and a draw of et+i-v, we define the V x re dimensional vector Y t (o) as 


f 1 {3 = Vs} 

3 <t, j £ [re] 

Ys,j{°) = s 2 e s j 

s > t, j £ [re] 

U {3 = 0} 

s = t, j £ [re] 


With this notation, we write the linear objective as 


V K, t —1 

2 + Mt t g + M SiVe = M T Y t (o). 

S=t+lk=l S=1 


We are now ready to write down the SDP relaxation that we shall solve for every round t and every o £ [re] 
(these are the i? t (o)’s from (7)). The optimization problem is based on the r th level of Lasserre SDP 
relaxation, written is the vector form as follows. First, we introduce a vector Ug )a for every S C [V] with 
|Sj < r and every a £ [re] s . The optimization problem is now written as 


SDP) st (y, K) = max A 

s-t- J2 J2 Rc ( a ) ||U ( s c ,c > ) || 2 < K 

cEU^t 

(U(s liC> i), U(s 2iQ , 2 )) = 0 

U(s 2iQ , 2 )) = (U(s 3 ,o! 3 ), U(s 4|Q!4 )) 

K 

EiiU(«.«ir=i, n u 0,0ii 2=i 

k =i 

(U(Si,ai), U(s 2 ,a 2 )) ^ 0 
y iiU( SU{ „ } , a o«ii B( v> p) — c j ||U(S )0 .) ii 

v£V 

? 6 W” 

^^ ii 112 ii 112 

} y ||U( SU {„} jQO/ 3)|| y( V)J s) > A ||U(s iCe )|| 

v£V 

/36M” 


( 10 ) 

( 11 ) 


Vai(SinS , 2)^a 2 (S'inS2) 
V5i U S 2 = S 3 U S 4 , cki o a 2 = oc 3 o 04 

Vi € [V] 


VS 1 , S a , ai, q?2 
VS, a, j £ [d] 


VS,q (12) 


where in the above R c £ [re ] Sc is the constraint violation mapping corresponding to constraint c. The first 
constraint in the above program is the requirement that cumulative constraint violation does not exceed K. 
The rest of the constraints are standard (the notation ci\ o «2 denotes the concatenated assignment of labels 
whenever the assignments don’t have a mismatch on the common entries). The above formulation is similar 
to the formulation for CSP’s using Lasserre hierarchy, and we refer to [Tul09, RS09, GS13, Sch08] for a more 
detailed treatment of the semidefinite relaxation technique. 


In the above optimization problem, maximizing over A can be performed efficiently as follows. First for 
a given A, we assume that we can solve the following optimization problem: 

SDP? nd (Y,T) = min ^ ^ R c (a) ||U Sc , a) || 2 (13) 

o:G[q , ]'^ c 

under the constraints of SDP* st excluding constraint (11) (14) 

To find the solution to the maximization problem in (10) we simply perform a binary search over A to find 
the largest A for which the value of the solution of (13) is smaller than K. 

On each round t £ [V] and for each o £ [k], we find the value of SDP lst (F‘(o), K). This gives Rt{o) 
in (7), and, consequently, the randomized prediction obtained from ( 8 ). One can think of the solution in 
as the projected solution from the r th level Lasserre hierarchy SDP. Specifically think of 
as being described by set of vectors U that satisfy the constraints of the SDP and M j k as f/({j},k) • It 

is important to note that for any constant level r, we obtain a poly-time algorithm. In the next session, we 
shall provide an analysis of the bound on the expected regret of this randomized strategy using the generic 
upper bound from (9). 


6 Regret Bounds Based on Existence of Rounding Strategies 

Let us define solutions to two other optimization problems in addition to the solution to SDP lst (Y, A'). 
These programs are defined for the purposes of analysis only, and will serve as a step to upper bounding 
Rademaclier complexity of the relaxed set. To this end, define: 


OPT 2 nd (Y, T) = min ^ c(M) subject to 

cgU¥( 

Y t M >A, M£ t xuv 

(15) 

OPTis^F, K) = max F T Y subject to 

Y, c(M) < K, M £ F Xuv 

(16) 


ceu^t 


Definition 1. Given X\-y = we define the gap between the Lasserre SDP solution at level r in 

(13) and the optimization problem in (15) as 


gap(r;Xi : y) 


OPT^ nd (e,D) 

sup -j-. 

e 6 {-i,i}' ,x »,D£[-yy SDP^ nd (e, D) 


Whenever the context of c &\-.viX\ : v is clear we will simply use gap(r). 

The following theorem provides a bound on the expected regret of the proposed randomized strategy 
based on gap. Observe that the regret bound only gains a multiplicative factor gap(r) in the constraint K, 
as compared to the original class. Below we prove our main theorem providing a bound on the expected 
regret of the proposed strategy in terms of the Rademacher complexity of the original class with its violation 
budget K enlarged. For notational convenience given sequence (^, x)\-v and any K > 0, let 


Rady(.Fi<:[Zi : y]) := E Cl;V 


V K 

sup y]y] e j,fc 1 {/0.j) 

feF K [T l:v] j = 1 k = 1 


k} 


The following theorem is a performance guarantee for the proposed prediction strategy. 

Theorem 3. If we use the r th level Lasserre hierarchy and use the randomized strategy obtained from the 
solutions via (8), the bound on the expected regret of the forecaster is given by 


E [Reg] < 2 Ex 1 :V Rady(J" gap ( r ).x[Xi : y]) 
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Proof. From the bound in (9) we have that the expected regret of our algorithm is bounded as 


E [Reg] < 2 Ez 1;V E ei;V 


V K 

sup V 

mgm( x 1:V ) j=t k=1 


Let M £ to be the projected solutions from the r th level Lasserre hierarchy SDP in the maximiza¬ 

tion problem in (10). Then for each draw of ei : y, the supremum in the Rademacher complexity term can 
be replaced by the value of the optimization problem in (10) given by SDP* st (e, K). This is because we can 
think of Mj,k as corresponding to where vectors U’s satisfying constraints of the SDP. On the 

other hand, for a given draw of e\ : y, the solution to 


V K 

sup 

/G^ r g ap(r)-X Pi: v] j—f 


is exactly the value of OPT lst (e, gap(r) • K). Hence to prove our bound, it suffices to show that for any 
problem at hand, 

SDP); st (e, K) < OPT lst (e, gap(r) • K). 

To do so we go through the problems in Eqns. (13) and (15) and arrive to OPT lst (e,gap(r) • A). Observe 
that the solution to the optimization problem in (10) is such that it has value SDP lst (e, K) and violates 
constraints by less than K. Using this feasible solution in (13) we conclude that, 

SDP 2 rld (e, SDP lst (e, A')) < K 


However by definition of gap(r) we can conclude that 

OpT 2rjd ( e , SDP* st (e, A)) < gap(r) • SDP 2 nd (e, SDPj st (e, A)) < gap(r) • K 

By the definition of OPT 2 " d this means that the solution M £ F X i V to the optimization problem is such 
that C (M) < gap(r) K, and simultaneously, since we are considering OPT 2,iJ with second argument 

as SDP); st (e, A')), M T Y > SDP,); st (e, K). Thus by using this solution in the optimization problem in Eq. 
(16) with second argument of gap(r) • K , we conclude: 

SDP^ st (e, A) < 0PT lst (e, gap(r) • A) 


as required. Now since this is true for every e, we have that 


E [Reg] < 2 Ez 1:V E Cl; , 


V K 


sup 

/E^gaptrJ K Hl = v] 


□ 

A few remarks are in order. First, since in the above gap(r) really refers to gap(r; ^i : y,cc^y), for 
^i:V,Xi : v drawn from the known generation process, bounds can often be improved: the behavior is given 
by the average case gap rather than the worst case gap. 

Second, we would like to stress that while the bounds in this section are provided in terms of integrality 
gaps, for the actual prediction algorithm we never require a rounding strategy. We only need existence of a 
rounding strategy with some integrality gap to provide bounds on the expected regret in terms of Rademacher 
complexity of the original class. 

Third, as already mentioned in the introduction, the approximation factor multiplies the regret bound 
rather than the cumulative loss of the benchmark predictor. That is, regret is still with respect to 1 x OPT. As 
long as the integrality gap is not too large for r = 0(1) of the Lasserre hierarchy, we obtain polynomial-time 
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algorithms even when the problem of finding the optimal benchmark predictor given all the instances and 
constraints might be computationally hard. This is due to the improper nature of the prediction algorithm. 

The Lasserre hierarchy is known to be more powerful than the Sherali-Adams and Lovasz-Schrijver hi¬ 
erarchies. This means that if we use for our prediction strategy some r £ N, then the gap(r) we obtain is 
smaller than approximation guarantees provided by algorithms using Sherali-Adams or Lovasz-Schrijver hi¬ 
erarchies at around the same r. Also clearly gap(r) < gap(r') for any r' < r. Thus we can use approximation 
guarantees proved for the same problems based on algorithms that use LP hierarchies at level r or smaller. 
In summary, a result about an integrality gap for any weaker relaxation has immediate implication for the 
regret bound, without affecting the algorithm we use. 

So far, we considered the problem where the benchmark was minimizing the number of violated con¬ 
straints. Alternatively one could think of T being restricted across items by requiring that at least K 
constraints need to be satisfied. Much of the machinery presented here including the application of rounding 
results to obtain bounds on the expected regret can easily be extended to such problems (which consist of 
typical CSP type problems) and in these cases the SDP optimization problems we solve on every step would 
be replaced by maximization versions of the SDP relaxations with the appropriate level of Lasserre hierarchy. 

7 Penalized Version of Relaxation 

In this section we consider a penalized version of the relaxation, putting the “< K ” constraint into the 
objective. We use the Lasserre hierarchy to solve the penalized version of the optimization problem. Let us 
write down the SDP corresponding to the r th level of Lasserre hierarchy. To this end, we introduce a vector 
Us, a for every S C [V] with |Sj < r and every a € [k] s . The optimization problem is written as 


SDP a (T,A) = min < 


A 


E E ^(«) ll u sc, 0 


cGUVt a,e[q]Sc 


E ll u (M>,8)ll 2i W) > 

vev 

P£[k] v 


(17) 


S.t. (U(s 1 , Q!1 ), U(s 2|C12 )) = 0 Van (Si n S 2 ) ^ a 2 (Si n S 2 ) 

(U(s 1 , Q , 1 ), U(s 2 , a2 )) = (U(s 3 , a3 ), U(s 4 , a4 )) VSi U S 2 = S 3 U S 4 , ai o a ?2 = Q 3 0 «4 

K 

E ll u ({»}.fc) l | 2 ~!’ Iiu 0 , 0 || 2 = 1 Vie [P] 

k=1 

(U(s 1 , ai ),U(s 2 , ct2 )) > 0 VSi, S 2 , ai, «2 

^ ) ||U(Su{tj},CKO/?)|| ||U(S, n) || VS, O', j e [d] 

vev 
PsIk] v 

This SDP should be compared to SDP(; st . Notice that the constraint (11) now appears in the objective. 
We now prove a “penalized version” of Lemma 2. We will also provide an appropriate relaxation from which 
an efficient prediction strategy follows. 

Let us define a slightly modified version of gap between the SDP solution and integral solution to the 
penalized optimization problem as follows. Define the optimization problem 

OPT a (F, A) = min A E C ( M ) ~ U T M 
ce utft 

s.t. M € T Xl . v (18) 

Definition 2. Given (ffi-.vi x i-.v)i we define the gap between the Lasserre SDP solution at level r in (17) 
and the optimization problem in (18) as 

gap(r; ^i:v, xi;v) := min |a : Ve e {—1, l} yxK , SDP A (Y, A) > OPT A (F, A/a)| 
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Whenever the context of c &i : v,xi:V is clear we will simply write gap(r). 


That is the factor by which we only scale down the constraint costs but not the linear part. 

Lemma 4. Given { c ^,x)i-.v, let G = Tk\L.v.v\> and fix any A > 0. Let £as(r, T Xl . v ) denote the set of vectors 
U ’s corresponding to the rth level Lasserre hierarchy—that is, vectors satisfying the constraints of the SDP 
in Eq. (17). The following relaxation is admissible for prediction with respect to Q: 

{ V K t 

2 E E e ^ll u (b>^)ll 2 +Ell u (M,, 3 )ir 

j=t + l k = l s = l 

-A E E ^(a)||U( Sc , a) || 2 1-t + AA' 

cG^iiV o:G[q , ]'^ c ' 


Further, the randomized strategy corresponding to the above relaxation is given by first drawing £t+i-.v 
Rademacher vectors and then predicting yt according to 

{ ( V K t 

sup < 2 e E e ^ii u (o-M)ir + Eii u (w.«.)ir 

U6£as(r,J I1;y ) [ j=t+l k=l s=l 

-A E E «<=(«) ||U (Sc ,a) II 2 } -q\yt] 

ce^i-.v aG[q] S c ) 

As before, each €j is a vector of independent Rademacher random variables and €j t k stands for the k th 
coordinate of this vector. 

Let us bound the regret of the algorithm. To this end, assume we have a bound on the gap for the 
penalized SDP. 


Theorem 5. Suppose that for any c > 1, 


X* = sup < A : XK < E ei ., 


sup 


Radv^-FoRTplnv]) < c p Rady(J r A-[Ii : y]) 
for some p < 1. With the notation of Lemma 4, if we choose 

V K 

EE 

ueCas(r,x Xl:V ) | t=lk=1 

the final relaxation is upper bounded by 

Rel ( G |0) < 4 gap(r) R&A V {T K pAv])- 

In view of Lemma 1, the expected regret of the strategy described in (6) is upper bounded as 


2 EE e j,k ||U W , fe II 2 -A E E ^(a)||U (Sc , a) | 

cGUt^t 


(19) 


E [Reg] < 4 gap(r) E Il;V Rad v{Fk\L\-.v])- 

To estimate A* in (19), we use the concentration property of Rademacher complexity. We sample 
Rademacher random variables, constraints, and side information. Next we optimize over the Lasserre SDP 
at level r multiple times to find the maximal A that satisfies the inequality 

{ V K 

2 EE e ^||u w , fc f -a E E r c («)|| u ( 5 c ,a) 

i=l fc=l c£U t % a£[ 5 ]Sc 



8 Examples 

We illustrate the results of this paper on two examples. The first uses the SDP formulation in Section 5, 
while the second example uses the penalized version of Section 7. 
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Binary Classification of Nodes with Cut Constraints Let us consider a weighted version of the 
problem discussed in the introduction. Suppose we are given a weighted graph G = (V, E, W) where 
W : E i—>■ [—1,1]. Let us consider the case of Xt = t and there is no other side information. The benchmark 
class of predictors is all binary labelings of the nodes from the set Tk = {/ £ [ 2] v : f T Lf < K}, where 
L is graph Laplacian. This problem can be formulated easily in the generic form specified in this paper by 
adding one constraint c per edge (u,v) £ E with S c = {it,u}. The cost of the constraint violation is given 

by 

R c (a) = 1 - W(e„,„)(2 1 (a(u) = a(i>)} - 1). 

These constraints can in fact be rewritten as quadratic constraints and Lasserre SDP at level r for the 
SDP^ nd problem in Eq. (13) is in fact the r th level SDP relaxation to the quadratic integer programming 
with a single linear constraint given by the labels (the one corresponding to (12) in SDP]; st ). 

It is shown in [GS13] that the value of a rounded solution with 0(r) levels of Lasserre hierarchy is 
no more than 2/A r (L) times OPT. Furthermore, the rounding is faithful, and hence concentration bounds 
hold for linear constraints [GS13, Thm 6.1]. Since the linear constraints are given by Rademacher random 
variables, standard concentration results tell us that de-randomization does not violate the constraints by 
more than O(VV). By tracing through the proof of Theorem 3, one can see that this extra O(VV) factor 
comes out additively in the final bound on Rademacher complexity. Since this factor is of smaller order than 
Rademacher complexity itself, the bound is not affected. We conclude that 

E [Reg] < O (E (^.zRvRady (•^ r min( 2 1 x v k\^ i:v])) 

where A r is the r th smallest eigenvalue of the normalized Laplacian of the graph, and the algorithm runs in 
time n°( r \ If the graph generation process is well behaved in terms of spectral values of the Laplacian—like 
in a preferential attachment model for the graph -then the bound we obtain is near optimal. As a crude 
upper bound on 

Rad ^ (^SE 

one can use 

\]KV max{l, Ar *} logP. 

Beyond the binary prediction considered above, one can also analyze the problem of predicting one of [k] 
labels for each node of a graph. As an interesting set of constraints, one can consider the Unique-Games- 
type constraints for labelings of edges in the graph. As a benchmark we compare our cumulative loss to 
the cumulative loss of the labelings that violate at most K of the labeling constraints on edges. Similar to 
the previous example, this problem can also we written with quadratic form for constraints. The integrality 
gap from [GS13] yields a bound on regret in terms of Rademacher complexity of the original class where the 
constraint K is enlarged by factor of order maxlljA” 1 }. Here, the de-randomization procedure incurs an 
additional 0(VkV) violation of the constraints, which, again, does not affect the final bound of Rademacher 
complexity. 

Online Prediction with Metric Labeling Constraints In the metric labeling problem [KT02], one 
aims to assign one of n labels to each of the V items, minimizing a combinatorial objective function consisting 
of two parts: assignment costs per item and separation costs based on pairs of items. This model subsumes 
MAP estimation in a Markov random field model. 

More precisely, let G = (V, E, W) be a weighted graph with W : E —> [0,1]. The cost of an assignment 
g £ [«] y is written as 

J^di(v,g v )+ W(u,v)d 2 (g u ,g v ) (20) 

u€[V] (u,v)EE 

where d 2 : [k] x [k] —> R> 0 is a metric on the space of labels and di : [V] x [k] —>• K> 0 is a cost of assigning a 
particular label to the node. The function d 2 is a metric on the space of labels, and this distance is multiplied 
by the edge weight, encouraging “similar” items (high edge weight) to pay more for disagreeing labels. 
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To map this setting into our notation, we define two types of constraints. The first type of a constraint c is 
associated to a singleton set S c = {u} and cost R c (g) = di(v,g v ), for g € [«j v . The second type corresponds 
to separation costs, and we define it through S c = {u,v} and R c (g ) = W(u,v)d 2 (g u , gv) if (u,v) is an edge, 
and 0 otherwise. 

To exhibit a polynomial-time method with a provable regret bound, we turn to the penalized version of 
SDP, developed in Section 7. We observe that both [KT02] and [CKNZ04] study linear relaxations of the 
integer program and prove integrality gaps which are based on the separation costs. Specifically, [CKNZ04] 
use a simple LP relaxation for the problem, and since Lassere hierarchy at any level r > 1 is strictly 
stronger than this Linear program, we can directly use the integrality gap from [CKNZ04] to obtain our 
regret bound. More precisely, [CKNZ04] shows that the integrality gap for the separation costs is O(logK), 
while the assignment costs are exact and have no integrality gap (gap of 1). The overall integrality gap is 
then stated as O(logft) by combining the two parts. However, for our purposes, it is important that the 
assignment costs are exact. To invoke the integrality gap result, we write the objective in (17) as (negative 
of) the total cost (20) with the linear part involving Y being incorporated into the assignment costs (per 
item). Since the values of Y could be negative, we may only appeal to Theorem 5 if there is no gap for the 
assignment costs. This is the case for the proof in [CKNZ04], and we conclude that 

gap (r) = O(logK). 

Theorem 5 then ensures a regret bound of Rademacher complexity of the class, increased multiplicatively 
by O(logK). 

The examples presented thus far extend to the case of having side information xt, as long as the set J~ X1V 
can be represented by polynomially-many constraints. One concrete example of when this can happen is if, 
for instance, we define 


4,.,iv = { / £ [2] V : inf E \ wTxv ~ ( 2 /» ~ 3 )l + p E W u,vd(fu , fv) < K 

W£B oo Z -' Z -' 

^ v£[V] (u,v)£E 

The above class encodes a prior belief that the set of well-performing (in terms of prediction) labelings are 
close to those given by some linear function of side information. 

Let us also mention an example where the constraints are defined in terms of the side information. 
Consider the above metric labeling problem, and imagine that the assignment cost di(it, •) is chosen according 
to Xt . We may use such a flexibility to provide a prior on the assignment of labels to individuals depending 
on the information about them. 

We remark that the metric labeling objective subsumes Multiway Cut, among other problems. The 
objective also subsumes the energy function of the Ising model. Constraints based on Multiway Cut and Ising 
model appear to be well-suited for modeling global information dispersed throughout the graph. Furthermore, 
as soon as a better integrality gap is proved for a particular instance of a problem (such as, say, a known 
constant integrality gap for metric labeling on planar graphs), it can be immediately used in the regret bound 
without changing the algorithm. 

9 A Lower Bound 

In this short section we prove a lower bound, showing that the algorithms we developed are near-optimal 
in terms of regret guarantees. We first consider the case of binary classification with k = 2. We show a 
simple lower bound on the expected regret in terms of the Rademacher complexity of the constrained set 
of predictors. Next, we use the binary case lower bound to obtain a lower bound for the general case when 
k > 2. We show that the worst case regret of any prediction strategy is lower bounded by 1/n times the 
Rademacher complexity. In summary, as long as the integrality gap is of constant order, and the Rademacher 
complexity of the class only depends polynomially on I \, the upper bounds we obtained are optimal up to 
a constant factor indicated by the gap. 
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Proposition 6. For any K, any generating process that produces (xt,1ft)tLi an d any class of benchmark 
predictors T C [2] x , there exists a strategy of labelings such that the following bound on the expected regret 
holds for any prediction algorithm: 


E [Reg] > -E(^ ia ,) 1;V Radv(J r A'[Ii : y]) 

Corollary 7. For any K, any generating process that produces (xt,^t)tLi and any class of benchmark 
predictors F C [k] x , there exists a strategy of labelings such that the following bound on the expected regret 
holds for any prediction algorithm: 


E [Reg] > —E(<^ >K ) 1:V Rady (J-A'pi : y]) 

hi 
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A Proofs 


Proof of Lemma 1. At time t, given {‘rfs, x s }l =1 , let qt be a strategy defined by first drawing the random 
variables It+v.v = {f£,x)t+i-.v and then solving for the randomized strategy qt defined in (6). We shall first 
prove the following inequality for any t £ [V]: 


E 

‘tfttXt 


sup {^y t ~ qt [£(yt,yt)]+ Ei t+1;V [Rel (P K [l\-.v\ |yi=t)]} 

. yt 


< E It:V [Rel (Jif [X i : v\ \yi,t-i)\ 


( 21 ) 


Here, the random variables (fC s , x s ) follow the distribution given in (3). 

We will prove the above statement for any t G [V] by first starting from base case t = V and then working 
backward inductively. To this end consider the very last step. Given ffi-.v, %i-.v, yi-.V-i, 

sup {Eg vr ^ qv [£(yv,yv)\ + Rel (P K [Ay] \y\-.v)} 

w 

= inf sup{Ej v ^ gv [£(yv,yv)} + Rel (Pk[Zi-.v] \y\-.v)} < Rel [Tk [Av] \vi-.v-i) 

qv w 
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where the last inequality is by admissibility condition of the relaxation. Hence, we conclude that 


E 


Vv,xv L vv 


sup{Ej v ^, gv [£(yv,yv)\ + Rel (Tk[Zi-.v] \yi-.v)} 


[Rel (Tk Pl:v] |3/l:V-l)] 

< ©V ,XV 


This proves the base case. Now assume the statement holds for any r > t and let us conclude the statement 
for t. For the t th round, given ffi-.t, Xi-.t, yi-.t-i, 

sup {&y t ~ qt [£{y t ,yt)\ +Ei t+ i:v [Rel {T K [Ii:v\ but)]} 
yt 

= sup i E Xt+1;V [E y t ~q t (i t+1:V ) [£{yt,yt)]] + Ez t+1;V [Rel (F K [Ti:v] \yi-.t)] \ 

yt ^ 

< El t+ l: V 

By definition of qt , the above expression is equal to 


sup {% t ~§i(x t+1;V ) [Z{yt,yt)\ + Rel {P K [ii-.v} \yi-.t)} 

L yt 


= E Xt + l:V 


inf sup {E y t ^ qt [e(y t , y t )] + Rel {T K pbv] but)} 


< Ei i+ i;v [Rel (-?Jf pi:v] bi;t-i)] 

Thus we can conclude that, 

E Xt sup{E y t ^q t{ i t+1 . v) b(yt,yt)] + E x t+ i:v [Rel(Jbp 1: ^] but)]} < E It:V [Rel(J^[J 1:V ] bm-i)] 


Vt 


This proves (21) via the inductive argument. To conclude the proof of the lemma, note that by the dominance 
condition, 


^2^(yt,yt)~ inf <^2^(yt,yt)+ 'Rei{T K [i 1 .. v ] \yi-.v) 

Using the above inequality and Eq. (21) we conclude that, 

v v 


Ell;, 


J2l £ (yt,yt)} - mf £ (f( x t)^yt) 

t =i fe^Kpi-.v] t=1 


< Eii ; - 


[£{yt,yt)} + Rel {Fk\£uv] \yi-.v ) 


,t=i 

'V-l 


Ej/ t ~§i [£(yt,yt)] +^v,x v [Rel (Tk\L\-.v\ \yi-.v~i)\ 


— E Il: 

< Ex 1;V [Rel (. Tk (h-.v) |-)] 

This concludes the proof of the lemma. 

Proof of Lemma 2. The initial dominance condition is satisfied, since 

v v 

Relv (£? bi:vO = sup M StVs - V > sup M StVs - V 
Me7W“[ 

V V 

= SU P ^{ M s, Vb - 1) = - inf y]i{/(x s ) ± y s }. 
M&Mg , /eS ' 


□ 


S —1 


S =1 
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Next we show the recursive admissibility condition for the randomized strategy provided in the lemma. To 
this end note that, 


max {E yt ^ t [£(y t , y t )] + Rel (Q \y 1:t )} 

= max jl -E 9t ^ t [1 {y t = yt}] + E e v sup {V M s>Vs + 2 V V e jtk M jtk 1 - t 

ytelK] { (1=1 j=£i J 

{ ( t V K 

- E E t+ i:v%t~«(6 t+ i ..v)[ 1 {yt = yt}}+^e t+1 ..v sup \ y\M StVs +2 V Vei.kMy 


— E Cf + l:V 


max < -E ?t ~« t( e t+1 . v) [1 {yt = Vt}] + sup <^M s , Hs +2 Y ^ } - (t 

ytelK] \ m ^m \fri 


By the definition of the randomized strategy, the last expression is equal to 

t V K 


E, 


e t+l:V 


inf max -E St „ qt [1 {y t = y t }] + sup \ V M s ,„, + 2 Y Y tj,kMj,k > - (i - 

9t6A([«]) W 6[«] | M<=M\fr{ j=ZUt=l 


~ (t ~ 1) 
- 1 ) 


1 ) 


Using the minimax theorem, we can swap the infimum and supremum, and obtain equality to 

t V K 


E, 


=t + l: V 


sup min E yt ~ pt 

p t 6A([K])S»te[K] 


M 6 M 


= E Ct+liV sup I - max E yt ~ pt [1 {y t = y t }] + E. 


1 {y t = yt} + _sup AT Ms,y s + 2 EE Sj,kMj,k Z (t 1) 

j=t+lfc=l JJ_ 

ft V K 

sup lj2 Ms ’ y ° + 2 E E e -h fcM h 


j=t +1 k= 1 


Since 


PiSA(W) [ sheM 

max E [1 {y t = yt}] = max p t [i] > max p t [i] I Y M t ,j ) > YPt[i\M t ,i = E [M t y ,], 
yte[K]yt~pt *e[«] < 6 [k] \j J i J v't~Pt t 

the previous expression can be upper bounded by 

{ t-l V K 

E M S ,y s + ( Mt,y t E y' t ~ pt ) + 2 EE f-j. k 1 {'j . k 

s=1 ...... 

which is upper bounded by Jensen’s inequality by 


- it - 1 ) 


V sup < 

^yt^pt 

sup < 

Pi6A(W) 


MeM \ 


j=t+1 k= 1 





r 


Eet+i:V SU P < 

^y t ,yt~pt 

1 

sup 

Pi6A([«]) 

MeM \ 


j = t+1 k =1 


-(t-l). 


Since in above yt. and y' t are identically distributed, we can introduce an independent Rademacher random variable 
5t . The last expression is equal to 


Ee t+l: V SU P { E S t ,y’ 


Pt6A(H) 


,y t ,yt~pt 


< E £f+1:V sup < 

Es t 

sup < 

yt,y' t e[K] 


mgm \ 


t -1 v K 

SUp 'I y , Ms,y s + St(Mt : y t Mt,y') + 2 y y fj.k M;j,k 
MeM \7Ti 


j=t-\-1 k=1 


-(t-l) 
-(t-l) 


< E e t+1;V sup < 

E «t 

sup < 

yt£[n] 


MeM \ 


E Ms,y s + 2 5t.Mt, yt + 2 Yt E Cj,kMj,h 

j=t +1 fc=1 


-(t-l) 
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Now let e\* G {±1} K be defined as 1 on coordinate yt and independent Rademacher variables on the rest. For any 
j ^ yi, E [ejT] = 0 and €^ t yt = 1 and so the preceding expression is equal to 


sup < 


sup \ 

yt e [«] 


mgm \ 


1 i,k 


< E *t+l:V SU P 1 E 5 t 


Wt€[K] 


s=l k= 1 

' t-1 


j=t+1 k=1 


V K 

sup ^ ^ ^ M s ,y s -t- 2 ^ ^ H - 2 EE e j,kMjy : 

M£M \ i_i j=t+lk=l 


= E, 


t + l:V 'I Ee t 


= E e 


=i fc=i 

t— 1 K V K 

sup ,<E Ms,y s + 2 M tl k£t.,k + 2 EE e j,k^j, 

=1 k=1 j=t+lk=l 

V K 

sup l V M Sj y s + 2 V E ej'kMj'k 

MeM I S = 1 j=t fc=l 


-(t-1) 


M£M 

t-1 


-(t-1) 


- (t-1) = Rel (Q |yi:t-i) 


Thus we have shown admissibility of the relaxation and demonstrated that the randomized strategy for the 
forecaster is given by the one in the lemma. □ 


Proof of Proposition 6. To prove the lower bound, we simply consider an adversary who picks nodes in 
the fixed sorted order and at each time step draw tot: Xt from the known generating process and finally draw 
y t ~ Unif([«]). Now since yt is drawn independently and uniformly at random on every round, irrespective 
of how the forecaster picks y t , the expected loss of the forecaster is E [1 {y t yt}) = 1/ft. Thus we get the 
following lower bound on the expected regret. 


E [Reg] > E 


(* t x t )r = i 


E. 


'yi:V~Unif([2]) 


V/2 - inf 

/G-TucP i-.v. 


E 1 {/(-*) 


Ei 1;V 

Ey i; v ~Unif ([2]) 

sup E f 1 {/(**) = yt} - r) 





Now for the uniform distribution over yf s, since 1 }f(xt) = yt} — \ and 1 — 1 }f(xt) = yt} are identically 
distributed we see that, 


E [Reg] > E (xt ^ 

= Eli :V 

> Ex 1;V 

> Ex uv 


E 


'yi:V~Unif([2])Ee 


sup E et 1 {/(**) = yt} - o 

1:V] t= 1 


E e E. 


•yi,V~Unif([2]) 


sup E e by f l {f( x t)=yt} 

[Il-.v] t= l 


E c 


E c 


= 2 El1 -’ 


sup E E ?/ t -Unif([2]) [et,y t l{f(x t ) = y t }) 

[Il-.v] t — 1 

V 1 2 

sup EoEM{/w = i: }) 

Xnv] t= i 1 fe=1 

v 2 

sup EE e t,k 1 {f(x t ) = k} 

/G^ K [Xl:V] t =l k=l 


where the last line is because for any / and any instance Xt only one of 1 }f(xt) = 1} or 1 }f(xt) = 2 } will 
be 1 and the other is 0 . □ 

Proof of Corollary 7. This corollary follows by using a simple modification to Proposition 6 . We shall 
assume here that k is even. The simple modification is as follows: the adversary first picks uniformly 
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at random a number R from [/c/2]. Next the adversary uses exactly the lower bound construction as in 
Proposition 6 except that instead of picking y t ~ Unif([2]) the adversary picks y t ~ Unif({i?,i? + k/ 2}). 
Now notice that given draw of i?, this is exactly the binary case with labels R and R + /c/2. Hence we can 
use the proposition to bound the expected regret as follows: 


E [Reg] > 


1 


E Ex 


2 ii~Unif ( [k/2] ) 


— E Ex, v 

2 ii~Unif ( [k/2] ) ’ 


sup E E e t ,kl{f(x t ) = k} 

[Zl:V] t —l kG{R,R+K,/2} 

V K 

sup y~] Y] 1 {k G {R, R + /c/2}} £ t)fe l {f(x t ) = k} 

JeF K [1 1:V] t= i fc=1 


> 2®Xl:V, 


= -Ex 1:V , 

K 


SUP E E D TT / ,n I 1 ^ £ A + «/ 2 >>] Ct.fc 1 {/fat) = 


sup XE e*,fcl {/(z t ) = fc} 

/eJKpuv] t= i fe = i 


□ 


Proof of Lemma f. The proof closely follows the analogous proof of Lemma 2. Note that we deal directly 
with the relaxed set of Lasserre’s level r. To make the notation simpler, given a Lasserre vector set at level 
r, say U € £as(r, P Xl:V ), let Mj} = ||U({ J -y fc )|| and also for each t and each constraint we use the 

notation 

c(U)= £ i? c (a)||U (Sc>Q) || 2 

Q6[g] Sc 

Now let us proceed to verify that the initial dominance condition is satisfied by the relaxation. Note that 

- ~ j^jE 1 ^^) ^y»} + x E c (/)} + AA ' 


5=1 


fee 


,5=1 


CG^l; V 


y 


< — inf 

U6£as(r,f I1;V ) 


E M E + A E c(u) +ak, 

kS=l CG^l: V J 


where the first inequality holds because functions in C? = J r R'[Ti : v] are required to keep the sum over 
unsatisfied constraints below K by definition. The second inequality holds because the Lasserre solution 
is a relaxation of Q and hence larger than the solution within Q. Let us check the recursive admissibility 
condition. To show that the proposed randomized strategy is admissible, we prove the recursive admissibility 
condition using this strategy directly: 

max (E y t ^ t [£{y t , y t )] + Rel (Q \y 1:t )} 

yt e M 


= max < 1 - Ey t ^ t [1 {y t =yt}]+ E e 


Wt€[«] 

-t + \K 


sup 


UeC&s(r,J r x 1:V ) I S=1 


E M ^+ 2 E E e ** M &- A E c ( u ) 


j=t+ik =i 


cG^i 


= max < — E 


E 


»t€[«] I et+i-.vyt~qt(et+i-.v) 
- {t - 1) + A K 


[1 {y t =yt}+ E 


sup 


£ t+l:V U6£as(r,J a . 1:V ) “ 


E + 2 E E - A E c ( u ) 


j=t+ifc=i 


cG^l; y 


< E 

e t+l: V 


max < — 


E 


Wt6[«] 1 !/£~<2t(et-|-l : v) 

- (t - 1) + AA 


[1 {y t = yt}] + sup 


U6£as(r,^ 1;l ,) I S=1 


E<..+ 2 E E e ** M &- A E c ( u ) 


j=t+i k =i 


cG^iiV 
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by the definition of the strategy, 


= E 

«t+l:V 


i t V K 

E M ^vs + 2 E E - A E c ( u ) 

7^i jt?+ it^i ce V 1:V 


|^t€A([«]) yte[n] 

- (t - 1) + A K 

Using the minimax theorem, the above expression is equal to 


= E e 


sup min E yt ^ pt 
P t€A([ K ]) SteM 


-i {y t = yt}+ sup . E aC. + 2 E E e ^ M " A E c ( u ) 


UeCas(r,T Xl:V ) I g=1 


j=t+1k=1 


cG^l; V 


- (t - 1) + A K 


= E sup < - max E [1 { y t = y t }] + E wt ^ pt 

*t+l:V Pt eA([«]) »t€[lt]»t~Pt 


{ t V K 

E A C+ 2 E E e ** M &- A E c ( u ) 

s=i j=t+ife=i ce'ifi.v 


- (t - 1) + A K 

Once again, by the constraint in the SDP that for any f, X]fc=i ||U({t},fc)|| 2 = X]fe=i = 1 we can conclude that 
max E [1 {yt = yt}] = max pt[i\ > max pt[i\ ( J > = E [M t u d. 

* 6 [k] *€[k] * 


Hence, we conclude that, 


max {Ey t ^ t [£{y t ,yt)] + Rel (Q |j/i :t )} 


< E sup 


E 


£ t+i:v Pt eA([«]) ^yt~pt 

- (t - 1) + A K 


sup E <vs + (<y t - E y' t ~p t KjI ) + 2 E E O / '/,'i - A E C ( U ) 

Ue£as(r,J r CCl;V ) I S=1 L J 


j=t+1 k =1 


cG^iiV 


using Jensen’s inequality to pull out the expectation, 


— |_i;v 


sup 

ptGA([«]) 


E„ 


sup 


/vm u „ 




Ue£as(r-,J a!l . y ) ^ s _i 


+ (ME - M",) + 2 E E o / -A E c ( u ) 

i=t+i fc=i ce^i-v 


- (t - 1) + XK 


since yt and y' t are identically distributed, we can introduce Rademacher random variable 5t, 


Ee t+1 . v sup < E 

K6A(W) I St,v' t ,vt~pt 

- (t - 1) + A K 


sup . E <,+mC-<i>+2 E E £ ^- a E c ( u ) 


UG£as(r,J a!l;V ) I S=1 


j=t +1 k =1 


CG^l;' 


< E sup < Ej t 
e, + LV m.slcW ^ 

- (f - 1) + A K 


sup , E MZ S + 6t(M? yt - M t u y ,) + 2 E E - A E C (U) 


UG£as (r,F Xl;V ) I S=1 


j=t+1 k= 1 


CG^l; V 


< Ee t+1;V sup < E s t 
2 /te[<t] I 

- (f - 1 ) + A K 


sup , E ME+2«5 t ME+2 E E e ^ M ^- A E C (U) 


Ue£as(r,J r £Cl:V ) I S=1 


j=t+l k =1 


CG^l; 
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Let Cj* G {±1} K be defined as 1 on coordinate yt and independent Rademacher variables on the rest. For any j ^ yt, 
E [efh] = 0 and e\* yt = 1 and so, 


< E et+liV sup < E 4t 

Wt€[n] I 


sup 


UG£as(r,J' a;i:V ) I fl=1 


E + 2 E 5tM ^ E K‘J + 2 E E - A E c ( u ) 


j=t +i fc=i 


CG^l; V 


— (t — 1) + XK 

<E et+ i :V sup Je 4 yt 
yt 6 W I 

- (t - 1) + A K 


sup 


UG£as (r,Tx 1:V ) \ S=1 


IE M «v. + 2 E + 2 E E - A E c ( u ) 


3=t+1 fc=1 




— ®b t+1: y \ 


SU P ^ E + 2 E M t?kCt,k + 2 E E e i,k M j!k - A E c ( u ) 

fc=i i=t+ifc=i 


= E e 


UG£as(r,Ja; 1:V ) I S=1 

- (t - 1) + XK 

( t -1 V K 

sup ^E M E + 2 EE e ^ M ^- A E < u ) 

j=t k =1 cG^i : v 


Ueras(r,J» 1:l ,) I S _1 

= Rel (<J |yi: t -i) 


- (t - 1) + AA 


□ 


Proof of Theorem 5. We have 
Rel (Q |0) =X*K+E ei . 


V K 


sup \ 2 EE £t,fc ||U{ t}ifc || - A * E E Rc( a ) ||u ( s ci c 

cGUt^t 


< 2E ei;V 


UG/3as(r,^ r x^.y ) l t=l k=l 
V K 


sup 

UG£as(r,J r a;i;V ; i t=1 fc=1 


2 EE e ^ll u hu-ll - A * E E ^(“)l|U(s c 

cGUt^t aG[g] Sc 


= 2 A *K. 

Now by definition of gap(r) we conclude that 


Rel (Q |0) < 2 E e 


V K 


sup l 2 E E -53 E •=<«) 


Me/ S 


1:V t t=l k = 1 


gap(r) 


cGUt^t 
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Defining K t = 2®, we get an upper bound, 


V K 


< 2 E eisV 


max 

iez 


sup 


A* 


£ ■=("> 


„ ‘1:V f t= 1 fc=l 

^-i<£ c gu t ^ c(M)<Ki 


< 2 max < E, 


'€l:V 


V « 


sup 

MeT x 1:V 

_E c gu t ^ t c(M)<Ki 




= 2 max< Radv(J r ff i [Xi:\^])— 


gap(r) 


t=i fc=i 


AA-i 


gap(r) 


cGUt^t 


i^i-i 


= 2 max < Rady(Jif i pi : y]) — n -, , 

*€Z [ 2gap(r) 


a; 


<2 maxJ RadybF <■ ic a Jli-yl) - ——— A', 
iez \ 2 gap(r) 


<2 rngcj maxjl.E} Rad vK^Zwl) - ^pgpyif.} • 


Now let us split the analysis into two cases. First, if A* > 


* ^ 2 gap(r) Rady (J~k \Zl-.v\) 


A' 


then 


Rel (C/ |0) < 2 max j^max |l, -p j- Rad^-AApmv]) — -^RaRy-(J 7 K-[Xi : v’])^ E 2 Rady (.Fa pn^]) 

where the last line is because p < 1. Next let us consider the case when A* < 2Rad y (jvKpiw]) _ p or this case 
however, note that we already showed that Rel {Q |0) < 2A *K and so 

Rel [Q |0) < 4 gap(r) Rady (FApmy]). 

The first statement follows. The second statement of the Theorem is an immediate consequence of Lemma 1. 

□ 
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